Building & Preparing a RSQLite Database Using R

A step-by-step guide for constructing an SQLite database using Airbnb data.

Nils Dosaj Mikkelsen, Jose Lama https://example.com/norajones
2022-03-27

Introduction

For this project, we demonstrate how to separate, clean and upload host and listing data taken from Airbnb for the city of Bangkok in Thailand.

The data for our project can be found here

Libraries

The following libraries are used for data cleaning and database construction

# Data cleaning libraries
library(readr)
library(tidyr)
library(stringr)
library(tibble)
library(dplyr) 

# Database libraries
library(DBI)
library(RSQLite)

# ER Diagram libraries
library(dm)

Data/Database Connection

Begin by loading the downloaded data into R using the read_csv() command from the readr library.

data <- read_csv("../data/listings.csv.gz")

We also need to form a database connection object:

con <- dbConnect(RSQLite::SQLite(), "airdb.SQLite")

Useful Functions

Before moving on to the data processing stage, we first define a few useful functions.

The remove_live_database() function

While we are actively working on our database construction, we may find it useful to periodically disconnect and rebuild our database from scratch. Having to close R, delete the airdb.SQLite file and reopen our session repeatedly can be tedious. To work around this, we use the following remove_live_database() function to disconnect and delete any current working database while R is still open:

remove_live_database <- function(con){
    if(file.exists("airdb.SQLite")){
        if (exists("con")){
            dbDisconnect(con)   
        }
        file.remove("airdb.SQLite")
    }
}

The insert_to_sql() function

In order to facilitate the insertion of data into our RSQLite database, we the following insert_to_sql() function. This function performs a number of operations prior to insertion:

  1. Create a vector of the column names in provided the data argument.
  2. Makes sure that all NA out values are in the correct form. For all non NA values, replace the double quotes with single quotes and bookend all strings with double quotes.
  3. Join all column data into a single string, transform NA values to NULL and trim all additional whitespace.
  4. Prepend and append each string with a left and right parentheses respectively before joining all values into a single string and removing all \\’s. Finally, use the paste0() function to convert our string into a query prior to insertion in our database’s corresponding relevant table.
  5. Insert the now prepared data into the database.
insert_to_sql <- function(con, table, data){
    
    
    # (1)
    column_name <- paste(names(data), collapse = ", ")
    
    # (2)
    data_new <- data %>% 
        mutate_if(is.character, function(x) ifelse(is.na(x), NA,  x %>% 
                                str_replace_all('"', "'") %>% # Replace " with '
                                paste0('"', . , '"') # Add " before and after string
        )
        )
    
    value_data <- apply(data_new, MARGIN = 1,
                        function(x) x %>%
                            paste(collapse = ",") %>% # Join all column into single string
                            str_replace_all("\\bNA\\b", "NULL")  %>% # Create NULL from NA
                            str_trim() # remove unnecessary whitespace
    )
    # (4)
    query_value <- paste(value_data) %>% 
        paste0("(", ., ")") %>% # Add bracket before and after string
        paste(collapse = ", ") %>% # Join all values into single string
        str_remove_all("\\\\") %>% # Remove \\ from string
        paste0("INSERT INTO ", table, "(", column_name, ") VALUES ", .)
    
    # (5)
    dbSendQuery(con, query_value)
}

Data

The data used in for this project is provided by Airbnb and can be found here. To get started, first download the listings.csv.gz from the previous link. In order to help streamline the process of constructing our database, it is beneficial to split the .csv file into a listing table and a host_info table. The E/R (Entity/Relationship) diagram in the next section shows how these two tables are related and the attributes they each hold.

E/R Diagram

From here, we will work on separating and cleaning the two respective tables prior to insertion in our database.

Initial Data

We can examine our initial data using the glimpse() command from the tibble library.

data %>% glimpse()
Rows: 17,040
Columns: 74
$ id                                           <dbl> 27934, 27979, 2~
$ listing_url                                  <chr> "https://www.ai~
$ scrape_id                                    <dbl> 2.021122e+13, 2~
$ last_scraped                                 <date> 2021-12-24, 20~
$ name                                         <chr> "Nice room with~
$ description                                  <chr> "Our cool and c~
$ neighborhood_overview                        <chr> "It is very cen~
$ picture_url                                  <chr> "https://a0.mus~
$ host_id                                      <dbl> 120437, 120541,~
$ host_url                                     <chr> "https://www.ai~
$ host_name                                    <chr> "Nuttee", "Emy"~
$ host_since                                   <date> 2010-05-08, 20~
$ host_location                                <chr> "Bangkok", "Ban~
$ host_about                                   <chr> "Hi All, I am n~
$ host_response_time                           <chr> "N/A", "N/A", "~
$ host_response_rate                           <chr> "N/A", "N/A", "~
$ host_acceptance_rate                         <chr> "N/A", "N/A", "~
$ host_is_superhost                            <lgl> FALSE, FALSE, F~
$ host_thumbnail_url                           <chr> "https://a0.mus~
$ host_picture_url                             <chr> "https://a0.mus~
$ host_neighbourhood                           <chr> "Victory Monume~
$ host_listings_count                          <dbl> 2, 2, 1, 1, 1, ~
$ host_total_listings_count                    <dbl> 2, 2, 1, 1, 1, ~
$ host_verifications                           <chr> "['email', 'pho~
$ host_has_profile_pic                         <lgl> TRUE, TRUE, TRU~
$ host_identity_verified                       <lgl> TRUE, FALSE, FA~
$ neighbourhood                                <chr> "Samsen Nai, Ba~
$ neighbourhood_cleansed                       <chr> "Ratchathewi", ~
$ neighbourhood_group_cleansed                 <lgl> NA, NA, NA, NA,~
$ latitude                                     <dbl> 13.75983, 13.66~
$ longitude                                    <dbl> 100.5413, 100.6~
$ property_type                                <chr> "Entire condomi~
$ room_type                                    <chr> "Entire home/ap~
$ accommodates                                 <dbl> 3, 2, 2, 2, 2, ~
$ bathrooms                                    <lgl> NA, NA, NA, NA,~
$ bathrooms_text                               <chr> "1.5 baths", "1~
$ bedrooms                                     <dbl> 1, 1, 1, 1, 1, ~
$ beds                                         <dbl> 1, 2, 1, 1, 1, ~
$ amenities                                    <chr> "[\"Wifi\", \"H~
$ price                                        <chr> "$1,845.00", "$~
$ minimum_nights                               <dbl> 3, 1, 60, 5, 1,~
$ maximum_nights                               <dbl> 90, 730, 730, 3~
$ minimum_minimum_nights                       <dbl> 3, 1, 60, 5, 1,~
$ maximum_minimum_nights                       <dbl> 3, 1, 60, 5, 1,~
$ minimum_maximum_nights                       <dbl> 90, 730, 730, 3~
$ maximum_maximum_nights                       <dbl> 90, 730, 730, 3~
$ minimum_nights_avg_ntm                       <dbl> 3, 1, 60, 5, 1,~
$ maximum_nights_avg_ntm                       <dbl> 90, 730, 730, 3~
$ calendar_updated                             <lgl> NA, NA, NA, NA,~
$ has_availability                             <lgl> TRUE, TRUE, TRU~
$ availability_30                              <dbl> 27, 23, 29, 11,~
$ availability_60                              <dbl> 57, 53, 59, 11,~
$ availability_90                              <dbl> 87, 83, 89, 11,~
$ availability_365                             <dbl> 362, 358, 364, ~
$ calendar_last_scraped                        <date> 2021-12-24, 20~
$ number_of_reviews                            <dbl> 65, 0, 0, 1, 0,~
$ number_of_reviews_ltm                        <dbl> 0, 0, 0, 0, 0, ~
$ number_of_reviews_l30d                       <dbl> 0, 0, 0, 0, 0, ~
$ first_review                                 <date> 2012-04-07, NA~
$ last_review                                  <date> 2020-01-06, NA~
$ review_scores_rating                         <dbl> 4.85, NA, NA, 4~
$ review_scores_accuracy                       <dbl> 4.95, NA, NA, 5~
$ review_scores_cleanliness                    <dbl> 4.81, NA, NA, 3~
$ review_scores_checkin                        <dbl> 4.97, NA, NA, 2~
$ review_scores_communication                  <dbl> 4.91, NA, NA, 4~
$ review_scores_location                       <dbl> 4.66, NA, NA, 4~
$ review_scores_value                          <dbl> 4.75, NA, NA, 5~
$ license                                      <lgl> NA, NA, NA, NA,~
$ instant_bookable                             <lgl> FALSE, FALSE, F~
$ calculated_host_listings_count               <dbl> 2, 2, 1, 1, 1, ~
$ calculated_host_listings_count_entire_homes  <dbl> 2, 1, 0, 0, 0, ~
$ calculated_host_listings_count_private_rooms <dbl> 0, 1, 1, 1, 1, ~
$ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, ~
$ reviews_per_month                            <dbl> 0.55, NA, NA, 0~

Initial Preprocessing

Before we separate the data into two separate tables, we perform the following preprocessing steps on all entries:

  1. Convert all columns of the type date to the type character to help process NA values.
  2. Convert all of the different NA representations (blanks, None, N/A, NA) to NA.
  3. Convert all columns that contain dates back to the type data.
data <- data %>% 
        # (1) Convert dates to characters for NA values
        mutate(last_scraped = as.character(last_scraped),
               host_since = as.character(host_since),
               calendar_last_scraped = as.character(calendar_last_scraped),
               first_review = as.character(first_review),
               last_review = as.character(last_review),
               ) %>% 
        
        # (2) Homogenize NA values
        #*# Taken from: https://rpubs.com/Argaadya/create_table_sql
        mutate_all(function(x) ifelse(x == "" | x == "None" | x == "N/A", NA, x)) %>%  #*#
        # mutate_all(function(x) ifelse(is.na(x), "NULL", x)) %>% 
        
        # (3) Convert character strings back to date type
        mutate(last_scraped = as.Date(last_scraped),
               host_since = as.Date(host_since),
               calendar_last_scraped = as.Date(calendar_last_scraped),
               first_review = as.Date(first_review),
               last_review = as.Date(last_review))

We are now ready to move on to constructing our host_info table.

Host Table

Since a host can have many listings, it’s beneficial to split our initial data into two tables, one containing the info related to hosts and the other containing the info related to listings, before inserting into our database. The first table that we create, is the host_info table with the same attributes as shown in the E/R Diagram above.

Data Cleaning

We are now ready to extract and clean the data from our initial data table in order to construct our host_data table. This is done and the following four steps:

  1. Extract the relevant columns from our initial data table. Note that we use the : syntax to grab many columns at once. This syntax is inclusive.
  2. Remove duplicate rows using the distinct() function from the dplyr library.
  3. Convert the host_since column back to the type character. This is required since RSQLite does not support data of the type date. We will convert this column back to the correct type when performing queries later.
  4. Can we use the str_remove_all() function from the stringr library to convert the host_verifications sublists into simple strings.
e.g. “[‘email’, ‘phone’]” \(\rightarrow\) “email, phone”.
    # (1) Extract host data 
    host_data <- data %>% 
        select(host_id:host_identity_verified, 
               calculated_host_listings_count:calculated_host_listings_count_shared_rooms)
    
    
    # (2) Remove duplicate values
    host_data <- host_data %>% distinct()
    
    
    # (3) Convert dates
    # Note that this will need to converted back to type = date for analysis
    host_data  <- host_data %>% mutate(host_since = as.character(host_since)) 
    
    
    # (4) Clean host verification column
    host_data <- 
        host_data %>% 
        mutate(host_verifications = str_remove_all(host_verifications, "[\\'\\[\\]]"))

We can now view our clean data:

rmarkdown::paged_table(host_data)

Host_info Table Creation

Now that our host_data table is clean. We can create the equivalent table as a query, initially as a string listing the table’s columns, before creating the empty table in our database using our con object:

#################### Create table for host info
    query <- "CREATE TABLE host_info(
        host_id INT, 
        host_url VARCHAR(50), 
        host_name VARCHAR(100), 
        host_since VARCHAR(50),
        host_location VARCHAR(500), 
        host_about VARCHAR(10000),
        host_response_time VARCHAR(50),
        host_response_rate VARCHAR(50),
        host_acceptance_rate VARCHAR(50),
        host_is_superhost BOOLEAN,
        host_thumbnail_url VARCHAR(500),
        host_picture_url VARCHAR(500),
        host_neighbourhood VARCHAR(50),
        host_listings_count INT,
        host_total_listings_count INT,
        host_verifications VARCHAR(500),
        host_has_profile_pic BOOLEAN,
        host_identity_verified BOOLEAN,
        calculated_host_listings_count INT, 
        calculated_host_listings_count_entire_homes INT,
        calculated_host_listings_count_private_rooms INT,
        calculated_host_listings_count_shared_rooms INT,
        PRIMARY KEY(host_id)
        )"

Data Insertion

Now create the empty table in our database.

dbSendQuery(con, query)
<SQLiteResult>
  SQL  CREATE TABLE host_info(
        host_id INT, 
        host_url VARCHAR(50), 
        host_name VARCHAR(100), 
        host_since VARCHAR(50),
        host_location VARCHAR(500), 
        host_about VARCHAR(10000),
        host_response_time VARCHAR(50),
        host_response_rate VARCHAR(50),
        host_acceptance_rate VARCHAR(50),
        host_is_superhost BOOLEAN,
        host_thumbnail_url VARCHAR(500),
        host_picture_url VARCHAR(500),
        host_neighbourhood VARCHAR(50),
        host_listings_count INT,
        host_total_listings_count INT,
        host_verifications VARCHAR(500),
        host_has_profile_pic BOOLEAN,
        host_identity_verified BOOLEAN,
        calculated_host_listings_count INT, 
        calculated_host_listings_count_entire_homes INT,
        calculated_host_listings_count_private_rooms INT,
        calculated_host_listings_count_shared_rooms INT,
        PRIMARY KEY(host_id)
        )
  ROWS Fetched: 0 [complete]
       Changed: 0

Schema Verification

Next, we check the schema of our database so far to ensure that it was loaded correctly.

res <- dbSendQuery(con, "PRAGMA table_info([host_info]);")
fetch(res)
   cid                                         name           type
1    0                                      host_id            INT
2    1                                     host_url    VARCHAR(50)
3    2                                    host_name   VARCHAR(100)
4    3                                   host_since    VARCHAR(50)
5    4                                host_location   VARCHAR(500)
6    5                                   host_about VARCHAR(10000)
7    6                           host_response_time    VARCHAR(50)
8    7                           host_response_rate    VARCHAR(50)
9    8                         host_acceptance_rate    VARCHAR(50)
10   9                            host_is_superhost        BOOLEAN
11  10                           host_thumbnail_url   VARCHAR(500)
12  11                             host_picture_url   VARCHAR(500)
13  12                           host_neighbourhood    VARCHAR(50)
14  13                          host_listings_count            INT
15  14                    host_total_listings_count            INT
16  15                           host_verifications   VARCHAR(500)
17  16                         host_has_profile_pic        BOOLEAN
18  17                       host_identity_verified        BOOLEAN
19  18               calculated_host_listings_count            INT
20  19  calculated_host_listings_count_entire_homes            INT
21  20 calculated_host_listings_count_private_rooms            INT
22  21  calculated_host_listings_count_shared_rooms            INT
   notnull dflt_value pk
1        0         NA  1
2        0         NA  0
3        0         NA  0
4        0         NA  0
5        0         NA  0
6        0         NA  0
7        0         NA  0
8        0         NA  0
9        0         NA  0
10       0         NA  0
11       0         NA  0
12       0         NA  0
13       0         NA  0
14       0         NA  0
15       0         NA  0
16       0         NA  0
17       0         NA  0
18       0         NA  0
19       0         NA  0
20       0         NA  0
21       0         NA  0
22       0         NA  0

Host Data Insertion

Finally, insert our host_data table into the equivalent table in our RSQLite database using the insert_to_sql() function as defined above.

insert_to_sql(con, "host_info", host_data)

Database Verification

We can verify the contents of our newly created and populated RSQLite database table host_info by viewing the first 10 rows of each column as follows:

res <- dbSendQuery(con, "SELECT * FROM host_info LIMIT 10") 

out_db <- fetch(res) 
dbClearResult(res)

rmarkdown::paged_table(out_db)

Listing Table

We are now ready to move on to constructing our listing table. This table contains information related to the types of listings available and their relevant attributes. The process of constructing the listing table is very similar to the process of constructing the host_info table.

Data Cleaning

In order to prepare the relevant columns related to listing information we perform the following: 1. Remove all columns related to host_data. 2. Remove additional unnecessary columns. 3. Remove the dollar signs in the price column. 4. Perform the same transformation as we did above on the host_verifications column to the amenities column. 5. Convert all columns of the type date to the type character as is required for insertion into our database.

# (1) Remove host_data columns
listing_data <- data %>% 
    select( - names(host_data)[-1])


# (2) Remove extraneous columns 
listing_data <- listing_data %>% 
    select(-c(license, calendar_updated, bathrooms, scrape_id))


# (3) Remove dollar signs from price column
listing_data <- listing_data %>% 
    mutate(price = str_remove_all(price, "[$,]") %>% 
               as.numeric()
    )

# (4) Transform amenities and host verification column
listing_data <- listing_data %>% 
    mutate(amenities = str_remove_all(amenities, "[\"\\'\\[\\]]"))


# (5)  Convert dates to character
listing_data <- 
    listing_data %>% 
    mutate(last_scraped          = as.character(last_scraped), 
           calendar_last_scraped = as.character(calendar_last_scraped),
           first_review          = as.character(first_review),
           last_review           = as.character(last_review))

Listing Table Creation

Just as before, create our listing database table initially as a string to be inserted as a query.

#################### Create listing table
    query_2 <- [1856 chars quoted with '"']

Table Insertion

Insert the listing table into our database.

dbSendQuery(con, query_2)
<SQLiteResult>
  SQL  CREATE TABLE listing (
        id INT,
        listing_url VARCHAR(100),
        last_scraped VARCHAR(50),
        name VARCHAR(500),
        description VARCHAR(2000),
        neighborhood_overview VARCHAR(2000),
        neighbourhood VARCHAR(100),
        neighbourhood_cleansed VARCHAR(100),
        neighbourhood_group_cleansed VARCHAR(100),
        latitude DECIMAL(25,18),
        longitude DECIMAL(25, 18),
        property_type VARCHAR(100),
        room_type VARCHAR(100),
        picture_url VARCHAR(500),
        host_id INT,
        accommodates INT,
        bathrooms_text VARCHAR(100),
        bedrooms INT,
        beds INT,
        amenities VARCHAR(2000),
        price DECIMAL(15, 5),
        minimum_nights INT,
        maximum_nights INT,
        minimum_minimum_nights INT,
        maximum_minimum_nights INT,
        minimum_maximum_nights INT,
        maximum_maximum_nights INT,
        minimum_nights_avg_ntm DECIMAL(16, 5),
        maximum_nights_avg_ntm DECIMAL(16, 5),
        has_availability BOOLEAN,
        availability_30 INT,
        availability_60 INT,
        availability_90 INT,
        availability_365 INT,
        calendar_last_scraped VARCHAR(50),
        number_of_reviews INT,
        number_of_reviews_ltm INT,
        number_of_reviews_l30d INT,
        first_review VARCHAR(50),
        last_review VARCHAR(50),
        review_scores_rating DECIMAL(10, 5),
        review_scores_accuracy DECIMAL(10, 5),
        review_scores_cleanliness DECIMAL(10, 5),
        review_scores_checkin DECIMAL(10, 5),
        review_scores_communication DECIMAL(10, 5),
        review_scores_location DECIMAL(10, 5),
        review_scores_value DECIMAL(10, 5),
        instant_bookable BOOLEAN,
        reviews_per_month DECIMAL(10, 5),
        PRIMARY KEY(id),
        FOREIGN KEY(host_id) REFERENCES host_info(host_id)
    )
  ROWS Fetched: 0 [complete]
       Changed: 0

Data Insertion

Insert the listing data into our database using the same insert_to_sql() function as before.

insert_to_sql(con, "listing", listing_data)

Database Verification

Confirm that the data was indeed inserted correctly.

res <- dbSendQuery(con, "SELECT * FROM listing LIMIT 10")

out_db <- fetch(res)
dbClearResult(res)

rmarkdown::paged_table(out_db)

Database Disconnect

Finally, before exiting our program, disconnect from the database

Conclusion

We have provided an example of how to pull real-world data from an actual source, then separate, clean and load the data into a RSQLite database. Real-world data is often times initially messy, but that does not mean it cannot be wrangled into a form from which we can gain deeper insights. In our next post, we demonstrate how to answer some interesting questions given our newly created database.